DALI Lab Data Challenge 22W, Task 2: Modeling

Jeremy Hadfield, 3.2.2022

Chosen Dataset: Mental Health in Tech Survey

This is a dataset based on a survey of tech workers about mental health in 2014. It was acquired from OSMI (Open Sourcing Mental Illness) here: https://osmihelp.org/research, and has 1400 respondents. I'm passionate about mental health and fascinated by applying DS & ML to this area, so I decided to analyze this dataset and try to create machine learning models to predict mental health outcomes based on demographic variables and workplace factors.

Possible Questions:

Reading in the Data

Data Cleaning

Basic Data Exploration and Visualization

Possible feature variables (used to predict)

Demographics

Employment

Possible target variables (outcomes to predict)

Above: Pie chart demonstrating gender disparities in tech. This survey seems representative of disparities in the tech industry as a whole, as nearly 80% of respondents are male.

Above: Shows the distribution of responses to the survey, with the vast majority of respondents in the US or UK (and significantly more tech company respondents in the US).

Table above: Shows the proportions of care options at tech companies. Shows that most of the dataset (81%) works at a tech company, while only 18% work outside of tech. Treatment percentages do not vary significantly between in-tech and out-of-tech. However, there does seem to be a significant difference in whether there are negative consequences for mental health conditions - out-of-tech workers report that there are negative consequences to mental health conditions in their workplace, while only 14% of tech workers reported the same.

Data Modeling: Machine Learning

Model one: Is whether a worker is in a tech company a good predictor of whether they seek mental health treatment?

Here, I use a Random Forest Classifier (which is based on an average of decision tree predictions) to predict whether a worker seeks treatment for a mental health condition based on whether they are in a tech company or not.

Results

The model achieved an accuracy of 46.82%, which does not seem better than chance. Further, the confusion matrix shows that there were 134 false positives and 118 true negatives, showing that this model was not able to accurately predict when a worker sought treatment based on whether they were in tech or not. This is not surprising, as this is an extremely simple model that does not have enough features to predict the outcome accurately.

Model 2: Machine Learning Models with More Features

Here, we add more features to see if they can improve predictive accuracy. We also use encoding to convert the categorical features into numerical data.

Results interpretation: This model yields significant and interpretable results. It results in a model accuracy of 98.68%, which is the percentage of predictions that were correct. Further, it results in 188 true positives and 185 true negatives, and only 5 false negatives. This is an extremely accurate model.

The graph above also ranks each of the features in the model by their importance. Clearly, demographic factors like Age and Gender are the most important predictors of whether a person seeks mental health treatment. Another important factor is the family_history of mental health conditions. Factors relating to the workplace, like mental health benefits, care options, and observing negative consequences for mental health conditions, are significantly less important. However, they may still be relevant to improving the model's accuracy.